Readable workflows need simple data
نویسندگان
چکیده
Sharing scientific analyses via workflows has the potential to improve the reproducibility of research results as they allow complex tasks to be split into smaller pieces and give a visual access to the flow of data between the components of an analysis. This is particularly useful for trans-disciplinary research fields such as biodiversity and ecosystem functioning (BEF), where complex syntheses integrate data over large temporal, spatial and taxonomic scales. However, depending on the data used and the complexity of the analysis, scientific workflows can grow very complex which makes them hard to understand and reuse. Here we argue that enabling simplicity starting from the beginning of the data life cycle adhering to good practices of data management can significantly reduce the overall complexity of scientific workflows. It can simplify the processes of data inclusion, cleaning, merging and imputation. To illustrate our points we chose a typical analysis in BEF research, the aggregation of carbon pools in a forest ecosystem. We propose indicators to measure the complexity of workflow components including the data sources. We illustrate that the complexity decreases exponentially during the course of the analysis and that simple text-based measures can help to identify bottlenecks in a workflow. Taken together we argue that focusing on the simplification of data sources and workflow components will improve and accelerate data and workflow reuse and improve the reproducibility of data-driven sciences 1 2 1 1
منابع مشابه
Building and Documenting Workflows with Python-Based Snakemake
Snakemake is a novel workflow engine with a simple Python-derived workflow definition language and an optimizing execution environment. It is the first system that supports multiple named wildcards (or variables) in input and output filenames of each rule definition. It also allows to write human-readable workflows that document themselves. We have found Snakemake especially useful for building...
متن کاملTechnical report: CSVM dictionaries
CSVM (CSV with Metadata) is a simple file format for tabular data. The possible application domain is the same as typical spreadsheets files, but CSVM is well suited for long term storage and the inter-conversion of RAW data. CSVM embeds different levels for data, metadata and annotations in human readable format and flat ASCII files. As a proof of concept, Perl and Python toolkits were designe...
متن کاملFrom the Desktop to the Grid: conversion of KNIME Workflows to gUSE
The Konstanz Information Miner is a user-friendly graphical workflow designer with a broad user base in industry and academia. Its broad range of embedded tools and its powerful data mining and visualization tools render it ideal for scientific workflows. It is thus used more and more in a broad range of applications. However, the free version typically runs on a desktop computer, restricting u...
متن کاملAutomated protein function prediction - the genomic challenge
Overwhelmed with genomic data, biologists are facing the first big post-genomic question--what do all genes do? First, not only is the volume of pure sequence and structure data growing, but its diversity is growing as well, leading to a disproportionate growth in the number of uncharacterized gene products. Consequently, established methods of gene and protein annotation, such as homology-base...
متن کاملScientific Workflows: Business as Usual?
Business workflow management and business process modeling are mature research areas, whose roots go far back to the early days of office automation systems. Scientific workflow management, on the other hand, is a much more recent phenomenon, triggered by (i) a shift towards data-intensive and computational methods in the natural sciences, and (ii) the resulting need for tools that can simplify...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016